Goto

Collaborating Authors

 sparse gradient


Differentially Private Optimization with Sparse Gradients

Neural Information Processing Systems

Motivated by applications of large embedding models, we study differentially private (DP) optimization problems under sparsity of individual gradients. We start with new near-optimal bounds for the classic mean estimation problem but with sparse data, improving upon existing algorithms particularly for the high-dimensional regime. The corresponding lower bounds are based on a novel block-diagonal construction that is combined with existing DP mean estimation lower bounds.Next, we obtain pure- and approximate-DP algorithms with almost optimal rates for stochastic convex optimization with sparse gradients; the former represents the first nearly dimension-independent rates for this problem. Furthermore, by introducing novel analyses of bias reduction in mean estimation and randomly-stopped biased SGD we obtain nearly dimension-independent rates for near-stationary points for the empirical risk in nonconvex settings under approximate-DP.


Grad Queue : A probabilistic framework to reinforce sparse gradients

Hasib, Irfan Mohammad Al

arXiv.org Artificial Intelligence

Informative gradients are often lost in large batch updates. We propose a robust mechanism to reinforce the sparse components within a random batch of data points. A finite queue of online gradients is used to determine their expected instantaneous statistics. We propose a function to measure the scarcity of incoming gradients using these statistics and establish the theoretical ground of this mechanism. To minimize conflicting components within large mini-batches, samples are grouped with aligned objectives by clustering based on inherent feature space. Sparsity is measured for each centroid and weighted accordingly. A strong intuitive criterion to squeeze out redundant information from each cluster is the backbone of the system. It makes rare information indifferent to aggressive momentum also exhibits superior performance with larger mini-batch horizon. The effective length of the queue kept variable to follow the local loss pattern. The contribution of our method is to restore intra-mini-batch diversity at the same time widening the optimal batch boundary. Both of these collectively drive it deeper towards the minima. Our method has shown superior performance for CIFAR10, MNIST, and Reuters News category dataset compared to mini-batch gradient descent.


Asymmetric Momentum: A Rethinking of Gradient Descent

Zhang, Gongyue, Zhang, Dinghuang, Zhao, Shuwen, Liu, Donghan, Toptan, Carrie M., Liu, Honghai

arXiv.org Artificial Intelligence

Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.


SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Zhao, Minjun, Yin, Yichen, Mao, Yuren, Chen, Lu, Gao, Yunjun

arXiv.org Artificial Intelligence

Top-$k$ sparsification has recently been widely used to reduce the communication volume in distributed deep learning; however, due to Gradient Accumulation (GA) dilemma, the performance of top-$k$ sparsification is still limited. Several methods have been proposed to handle the GA dilemma but have two drawbacks: (1) they are frustrated by the high communication complexity as they introduce a large amount of extra transmission; (2) they are not flexible for non-power-of-two numbers of workers. To solve these two problems, we propose a flexible and efficient sparse communication framework, dubbed SparDL. SparDL uses the Spar-Reduce-Scatter algorithm to solve the GA dilemma without additional communication operations and is flexible to any number of workers. Besides, to further reduce the communication complexity and adjust the proportion of latency and bandwidth cost in communication complexity, we propose the Spar-All-Gather algorithm as part of SparDL. Extensive experiments validate the superiority of SparDL.


Adam

#artificialintelligence

Training a neural network consist of optimizing stochastic functions in a high-dimensional space. In this context, finding a computationally and memory-efficient method with fast convergence properties is hard. On one hand, the objective functions encounter in Deep Learning are in most cases non-convex and non-stationary. On the other hand, gradients can be noisy and/or sparse. For efficient stochastic optimization, Adam, just like RMSProp and AdaGrad, rely only on first-order gradients.


S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Ge, Keshi, Fu, Yongquan, Lai, Zhiquan, Deng, Xiaoge, Li, Dongsheng

arXiv.org Artificial Intelligence

Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators for parallel SGD training. We perform extensive evaluation against four state-of-the-art methods over five training models. Our results show that S2 Reducer converges to the same accuracy, reduces 81\% sparse communication overhead, and achieves 1.8$ \times $ speedup compared to state-of-the-art approaches.


Standardizing a Machine Learning Framework for Applied Research

#artificialintelligence

Until now, the Machine Learning (ML) frameworks we've used at Borealis AI have varied according to individual preference. But as our applied team grows, we're finding that a preference-based system has certain shortcomings that have led to inefficiencies and delays in our research projects. As a result, we identified two main arguments in favour of standardizing a single framework for the lab. It has been our experience that independent frameworks do not often "play well" together. For example, a TensorFlow-based model applied to one research project would have to be rewritten in PyTorch for another project.


Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Schaul, Tom, LeCun, Yann

arXiv.org Artificial Intelligence

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.


Learning sparse gradients for variable selection and dimension reduction

Ye, Gui-Bo, Xie, Xiaohui

arXiv.org Machine Learning

Variable selection and dimension reduction are two commonly adopted approaches for high-dimensional data analysis, but have traditionally been treated separately. Here we propose an integrated approach, called sparse gradient learning (SGL), for variable selection and dimension reduction via learning the gradients of the prediction function directly from samples. By imposing a sparsity constraint on the gradients, variable selection is achieved by selecting variables corresponding to non-zero partial derivatives, and effective dimensions are extracted based on the eigenvectors of the derived sparse empirical gradient covariance matrix. An error analysis is given for the convergence of the estimated gradients to the true ones in both the Euclidean and the manifold setting. We also develop an efficient forward-backward splitting algorithm to solve the SGL problem, making the framework practically scalable for medium or large datasets. The utility of SGL for variable selection and feature extraction is explicitly given and illustrated on artificial data as well as real-world examples. The main advantages of our method include variable selection for both linear and nonlinear predictions, effective dimension reduction with sparse loadings, and an efficient algorithm for large p, small n problems.